ln ln
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
- North America > United States > New York > Suffolk County > Stony Brook (0.06)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
Rank-Induced PL Mirror Descent: A Rank-Faithful Second-Order Algorithm for Sleeping Experts
We introduce a new algorithm, \emph{Rank-Induced Plackett--Luce Mirror Descent (RIPLM)}, which leverages the structural equivalence between the \emph{rank benchmark} and the \emph{distributional benchmark} established in \citet{BergamOzcanHsu2022}. Unlike prior approaches that operate on expert identities, RIPLM updates directly in the \emph{rank-induced Plackett--Luce (PL)} parameterization. This ensures that the algorithm's played distributions remain within the class of rank-induced distributions at every round, preserving the equivalence with the rank benchmark. To our knowledge, RIPLM is the first algorithm that is both (i) \emph{rank-faithful} and (ii) \emph{variance-adaptive} in the sleeping experts setting.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
high_prob_ls_nonconvex_final
Next, we will show that e ( x, S) is sub-exponential. Proposition 3. Let g = g ( x, U), and fix In Section 2.3 of [ BCCS21 ], it is shown that kr F ( x) r ( x) k p nL + p n We use these facts to show that Gaussian smoothed gradients gives a valid first order oracle. First, by the triangle inequality, we have k g ( x, U) r ( x) k k g ( x, U) r F ( x) k + kr F ( x) r ( x) k . To prove Lemma 2, we will first prove two additional lemmas. The first lemma shows that the number of large and successful iterations is bounded below by the number of large and unsuccessful ones up to a constant.
Continual Learning in Linear Classification on Separable Data
Evron, Itay, Moroshko, Edward, Buzaglo, Gon, Khriesh, Maroun, Marjieh, Badea, Srebro, Nathan, Soudry, Daniel
We theoretically study the continual learning of a linear classification model on separable data with binary classes. We analyze continual learning on a sequence Even though this is a fundamental setup to consider, there of separable linear classification tasks with binary are still very few analytic results on it, since most of the labels. We show theoretically that learning continual learning theory thus far has focused on regression with weak regularization reduces to solving settings (e.g., Bennani et al. (2020); Doan et al. (2021); a sequential max-margin problem, corresponding Asanuma et al. (2021); Lee et al. (2021); Evron et al. (2022); to a special case of the Projection Onto Convex Goldfarb & Hand (2023); Li et al. (2023)).
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States > New York (0.04)
- (6 more...)
Dueling Bandits: From Two-dueling to Multi-dueling
Du, Yihan, Wang, Siwei, Huang, Longbo
We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and finds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two efficient algorithms, DoublerBAI and MultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identification algorithms to the dueling bandit problem, and achieves a regret bound of $O(\ln T)$. MultiSBM-Feedback not only has an optimal $O(\ln T)$ regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an efficient algorithm MultiRUCB. Using a novel finite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an $O(\ln T)$ regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.
- Asia > China > Beijing > Beijing (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
Best of Both Worlds Model Selection
Pacchiano, Aldo, Dann, Christoph, Gentile, Claudio
We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL [Agarwal et al., 2017] algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.
Regret Bound Balancing and Elimination for Model Selection in Bandits and RL
Pacchiano, Aldo, Dann, Christoph, Gentile, Claudio, Bartlett, Peter
We propose a simple model selection approach for algorithms in stochastic bandit and reinforcement learning problems. As opposed to prior work that (implicitly) assumes knowledge of the optimal regret, we only require that each base algorithm comes with a candidate regret bound that may or may not hold during all rounds. In each round, our approach plays a base algorithm to keep the candidate regret bounds of all remaining base algorithms balanced, and eliminates algorithms that violate their candidate bound. We prove that the total regret of this approach is bounded by the best valid candidate regret bound times a multiplicative factor. This factor is reasonably small in several applications, including linear bandits and MDPs with nested function classes, linear bandits with unknown misspecification, and LinUCB applied to linear bandits with different confidence parameters. We further show that, under a suitable gap-assumption, this factor only scales with the number of base algorithms and not their complexity when the number of rounds is large enough. Finally, unlike recent efforts in model selection for linear stochastic bandits, our approach is versatile enough to also cover cases where the context information is generated by an adversarial environment, rather than a stochastic one.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
Improved Analysis of UCRL2 with Empirical Bernstein Inequality
Fruit, Ronan, Pirotta, Matteo, Lazaric, Alessandro
We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $\Gamma \leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{D\Gamma S A T})$.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)